Exercise 1: Preparation

In the load_data chunk we need to load all the data we will need for the analysis. We have 2 data sources:

The data is stored respectively in 3 data.frames: df.design, df.metabolomics.raw and df.metabolomics.design.

NOTE: An important difference compared to the tutorial file is that df.metabolomics.design is already joined with df.design.

Exercise 1.1

First thing to do when encountering a new dataset is to look at it’s shape and properties.

## Observations: 28,950
## Variables: 11
## $ bucket          <chr> "344.07941 Da 18.36 s", "344.07941 Da 18.36 s", "344.…
## $ RT              <dbl> 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.63, 0.63, 0.63,…
## $ mz_ratio        <dbl> 345.0867, 345.0867, 345.0867, 345.0867, 345.0867, 345…
## $ Name            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Formula         <chr> "C20H12N2O4", "C20H12N2O4", "C20H12N2O4", "C20H12N2O4…
## $ phase           <chr> "apolar", "apolar", "apolar", "apolar", "apolar", "ap…
## $ mass            <dbl> 344.0794, 344.0794, 344.0794, 344.0794, 344.0794, 344…
## $ charge          <dbl> 0.9970811, 0.9970811, 0.9970811, 0.9970811, 0.9970811…
## $ metabolomics_id <chr> "202002_TTP_SILS2665_RA_1_1 _merged_", "202002_TTP_SI…
## $ value           <dbl> 10066, 9210, 9564, 8882, 9346, 8106, 3996, 4116, 4132…
## $ value_log10     <dbl> 4.002900, 3.964307, 3.980685, 3.948560, 3.970672, 3.9…

Question 1.1

What is in data.frame and what do you think is the meaning of each column? Is the data in a tidy format?

Feature Statistics

As you might have notice, some of the features have a formula and/or a label associated to it but other don’t. In addition, there are features that have the same formula and/or label. Create a figure showing the percentage of features with a formula and with a name for each phase and a table showing the total number of features and with names and formulas and how many of those are unique for each phase.

phase total_count n_formula n_uniq_formula n_label n_uniq_label
apolar 11742 8460 1237 108 16
polar 17208 8718 1240 660 89

Distribution of Raw Values

Number of Zeros

Total Intensity

MA Plots

Normalizing

df.metabolomics <- df.metabolomics.raw %>%
  # remove metabolites with only 0 values
  group_by(phase, bucket) %>%
  filter(sum(value) > 0) %>%
  mutate(
    # replace zero values by half of the smallest observed value for that metabolite
    value_cor = ifelse(value <= 0, min(value[value > 0], na.rm = T) / 2, value),
    # calculate reference per metabolite
    pseudo_reference = ifelse(any(value_cor == 0), NA, mean(log(value_cor)))
  ) %>%
  group_by(phase, metabolomics_id) %>%
  mutate(
    # calculate median size factor per sample
    size_factor = exp(median(log(value_cor) - pseudo_reference, na.rm=T)),
    # correct value with size factor
    value_norm = value_cor / size_factor,
    value_norm_log10 = log10(value_norm)
  ) %>%
  # in this report I assume the data.frame to be grouped by phase only, so I set this grouping explicitly
  group_by(phase)

Total Intensity

MA Plots

PCA

df.pca <- df.metabolomics %>%
  mutate(
    value_log_norm = log10(value_norm)
  ) %>%
  select(phase, bucket, metabolomics_id, value_log_norm) %>%
  filter(!is.na(bucket)) %>%
  group_by(phase) %>%
  f_pca(features_from = "bucket", counts_from = "value_log_norm", samples_from = "metabolomics_id")

Loadings of PC 1

To extract the top- and bottom- 10 loadings we leverage the power of our nested data.frame and map:

df.pca.loadings <- df.pca %>% 
  mutate(
    top_10_pc_1 = map(pca_loadings, function(.d) {
      .d %>% filter(PC == 1) %>% arrange(-loading) %>% slice(1:10)
    }),
    bottom_10_pc_1 =  map(pca_loadings, function(.d) {
      .d %>% filter(PC == 1) %>% arrange(loading) %>% slice(1:10)
    })
  )

Most down-regulated metabolites of PC 1:

phase bucket loading
apolar 870.56984 Da 803.03 s -0.1827206
apolar 892.54105 Da 803.03 s -0.1272318
apolar 551.42610 Da 842.78 s -0.1212339
apolar 886.56398 Da 850.98 s -0.0992992
apolar 396.00441 Da 41.68 s -0.0990238
apolar 792.56643 Da 505.34 s -0.0989117
apolar 349.96167 Da 41.67 s -0.0960333
apolar 1052.52432 Da 807.07 s -0.0838531
apolar 381.98812 Da 41.95 s -0.0800602
apolar 335.94565 Da 41.39 s -0.0766024
polar 202.16809 Da 35.67 s -0.0919538
polar 545.16130 Da 170.88 s -0.0792119
polar 551.17860 Da 52.61 s -0.0760986
polar 1389.01032 Da 439.32 s -0.0755749
polar 279.13232 Da 47.75 s -0.0752341
polar 186.01639 Da 214.63 s -0.0749046
polar 505.17065 Da 53.87 s -0.0695598
polar 1275.92760 Da 432.91 s -0.0695352
polar 143.09457 Da 35.95 s -0.0663302
polar 153.98977 Da 215.00 s -0.0651861

Most up-regulated metabolites of PC 1:

phase bucket loading
apolar 195.05665 Da 40.87 s 0.2397694
apolar 412.09570 Da 40.31 s 0.1526381
apolar 175.96327 Da 18.43 s 0.1121804
apolar 467.00597 Da 39.82 s 0.1100497
apolar 131.97352 Da 18.49 s 0.0958542
apolar 243.95081 Da 24.01 s 0.0843453
apolar 379.92616 Da 20.53 s 0.0822617
apolar 311.93825 Da 20.61 s 0.0814280
apolar 243.95122 Da 70.97 s 0.0779820
apolar 379.92574 Da 33.46 s 0.0734391
polar 195.05672 Da 46.20 s 0.1721499
polar 780.22632 Da 47.97 s 0.1497580
polar 684.08708 Da 50.27 s 0.1309272
polar 330.11082 Da 47.20 s 0.1175668
polar 443.02423 Da 50.64 s 0.0849878
polar 537.17417 Da 52.41 s 0.0833498
polar 218.04115 Da 45.53 s 0.0828316
polar 343.95574 Da 45.53 s 0.0788116
polar 342.11236 Da 47.50 s 0.0745676
polar 507.04174 Da 50.76 s 0.0725621

Heatmap

Another useful tool is to plot the data in a heatmap.

This approach takes

df.heatmap <- right_join(
  # take our normalised values (these will be shown in the heatmap)
  df.metabolomics %>% select(phase, bucket, metabolomics_id, value_norm),
  # extract the top- and bottom-10 from the nested data.frame
  # if a bucket is in top- or bottom-10 it will be retained in our joined data frame
  bind_rows(
    df.pca.loadings %>% unnest(cols = top_10_pc_1) %>% select(phase, bucket, loading),
    df.pca.loadings %>% unnest(cols = bottom_10_pc_1) %>% select(phase, bucket, loading)
  ),
  by = c("phase", "bucket")
) %>%
  # add design information
  left_join(
    df.metabolomics.design %>% select(sample_id, metabolomics_id, purpose, phosphate_mM),
    by= c("metabolomics_id")
  ) %>%
  select(phase, bucket, sample_id, value_norm) %>%
  pivot_wider(
    names_from = sample_id, 
    values_from = value_norm
  ) %>%
  ungroup()

T-test

Volcano Plot

We can generalize our volcano plot code into a function to avoid repeating the same code over and over…

f_plot_volcano <- function(.d, .title = "Volcano Plot") {
  
  .d %>% 
    ggplot(aes(x = logFC, y = -log10(padj), colour = threshold)) +
    facet_wrap(~phase) +
    geom_point(alpha = 0.5) +
    geom_vline(xintercept = 0, colour = "black") +
    geom_vline(xintercept = c(-fold.change.threshold, fold.change.threshold), colour = "red") +
    geom_hline(yintercept = -log10(alpha), colour = "red") +
    geom_text_repel(
      aes(x = logFC, y = -log10(padj), label = bucket), 
      # feature of ggplot; it applies the function to the overall dataset -> only label those which breach our thresholds
      data = function(.d) { .d %>% filter(threshold) },
      colour = "grey20"
    ) +
    theme(legend.position = "none") +
    labs(
      title = .title, 
      x = "Effect size: log2(fold-change)", 
      y = "-log10(adjusted p-value)"
    )
  
}

List of Significant Metabolites

Finally print a table with all interesting buckets.

Polar Phase

Metabolites with adj P-value < 0.01 and absolute log ratio > 2
phase bucket Name Formula logFC p.value padj
polar 1389.01032 Da 439.32 s NA C55H112N44 -4.404664 0.0000000 0.0000000
polar 1275.92760 Da 432.91 s NA C70H134NO14PS -4.063928 0.0000000 0.0000000
polar 505.17065 Da 53.87 s NA C13H36N3O13PS -4.115039 0.0000000 0.0000000
polar 113.08366 Da 47.02 s NA NA 3.450712 0.0000000 0.0000000
polar 248.97570 Da 51.41 s NA NA 3.866483 0.0000000 0.0000000
polar 162.05319 Da 53.18 s NA C6H10O5 -2.891696 0.0000000 0.0000000
polar 99.06825 Da 48.34 s NA C5H9NO 3.982592 0.0000000 0.0000000
polar 324.10542 Da 52.62 s NA C17H16N4OS -2.856964 0.0000000 0.0000000
polar 557.08069 Da 53.14 s NA NA -2.824540 0.0000000 0.0000000
polar 135.05421 Da 46.92 s NA NA -3.754185 0.0000000 0.0000000
polar 653.80118 Da 35.63 s NA C15H6N6O8S8 -2.730258 0.0000000 0.0000000
polar 162.05294 Da 48.25 s NA C6H10O5 -3.936065 0.0000000 0.0000000
polar 307.01122 Da 49.47 s NA C8H9N3O8S -2.768852 0.0000000 0.0000000
polar 118.02789 Da 46.85 s NA C5H2N4 -2.308795 0.0000000 0.0000000
polar 218.04115 Da 45.53 s NA C5H10N6S2 4.687774 0.0000000 0.0000000
polar 195.05672 Da 46.20 s 2-_N-morpholino_ethanesulfonic acid C6H13NO4S 9.770304 0.0000000 0.0000000
polar 281.91637 Da 34.82 s NA C8H2N4S4 -2.635089 0.0000000 0.0000000
polar 684.08708 Da 50.27 s NA C19H28N10O10S4 7.320491 0.0000000 0.0000000
polar 851.81735 Da 35.69 s NA NA 2.225573 0.0000000 0.0000000
polar 443.02423 Da 50.64 s NA C21H10N5O3PS 4.805384 0.0000000 0.0000000
polar 54.04650 Da 51.02 s NA NA -2.001179 0.0000000 0.0000000
polar 80.96480 Da 48.87 s NA NA 3.110823 0.0000000 0.0000000
polar 551.17860 Da 52.61 s NA C24H42NO3PS4 -4.414098 0.0000000 0.0000000
polar 182.20346 Da 1169.50 s NA C13H26 -2.063661 0.0000000 0.0000000
polar 330.11082 Da 47.20 s NA C19H14N4O2 6.650154 0.0000000 0.0000000
polar 666.22332 Da 65.72 s NA C17H44N6O17P2 -2.052050 0.0000000 0.0000000
polar 153.98977 Da 215.00 s NA C6H2O5 -3.738726 0.0000000 0.0000000
polar 135.05435 Da 47.06 s Adenine C5H5N5 -2.358729 0.0000000 0.0000000
polar 468.37287 Da 1179.28 s NA C28H53O3P -3.145643 0.0000000 0.0000000
polar 505.42821 Da 1174.94 s NA C35H55NO -2.193067 0.0000000 0.0000000
polar 254.09980 Da 49.53 s NA C9H18O8 -3.317741 0.0000000 0.0000000
polar 491.44838 Da 1179.75 s NA C27H61N3O2S -2.666261 0.0000000 0.0000000
polar 202.16809 Da 35.67 s NA C10H22N2O2 -2.702719 0.0000000 0.0000000
polar 64.00597 Da 46.78 s NA NA -2.367665 0.0000000 0.0000000
polar 115.90590 Da 28.82 s NA NA 2.108976 0.0000000 0.0000000
polar 585.17066 Da 47.79 s NA C19H35N7O8S3 2.019161 0.0000000 0.0000000
polar 279.13232 Da 47.75 s NA C12H25NO2S2 -4.373231 0.0000000 0.0000000
polar 143.09457 Da 35.95 s NA C7H13NO2 -2.340761 0.0000000 0.0000000
polar 390.11373 Da 47.79 s NA C12H18N6O9 2.049031 0.0000000 0.0000000
polar 302.06080 Da 66.17 s NA C8H18N2O6S2 2.167386 0.0000000 0.0000000
polar 211.05125 Da 67.62 s NA C6H13NO5S 2.305008 0.0000000 0.0000001
polar 233.03331 Da 68.81 s NA C12H11NS2 2.310754 0.0000000 0.0000001
polar 780.22632 Da 47.97 s NA C27H53N6O8PS5 9.068171 0.0000000 0.0000001
polar 324.10409 Da 67.10 s NA C17H24S3 -2.232661 0.0000000 0.0000001
polar 309.96791 Da 67.81 s NA C8H3N6O4PS 2.340419 0.0000000 0.0000001
polar 418.57380 Da 48.36 s NA NA 2.475847 0.0000000 0.0000002
polar 255.96926 Da 300.67 s NA NA 2.381188 0.0000000 0.0000002
polar 440.14031 Da 52.94 s NA C15H20N8O8 2.408047 0.0000000 0.0000003
polar 636.12403 Da 101.36 s NA C17H44N6O3S8 -2.155888 0.0000000 0.0000003
polar 186.01639 Da 214.63 s NA C7H6O6 -4.430408 0.0000000 0.0000004
polar 312.13616 Da 50.15 s NA C19H20O4 2.482399 0.0000001 0.0000007
polar 975.28149 Da 48.06 s NA C33H49N23OS6 2.621164 0.0000001 0.0000010
polar 657.89435 Da 36.06 s NA C20H7N10O5PS5 2.118107 0.0000001 0.0000010
polar 86.03658 Da 36.94 s NA NA -2.371961 0.0000001 0.0000010
polar 85.05253 Da 37.75 s NA C4H7NO -2.292786 0.0000001 0.0000010
polar 211.05153 Da 67.98 s NA C6H13NO5S 2.909145 0.0000003 0.0000023
polar 195.45403 Da 48.06 s NA NA 2.907084 0.0000004 0.0000028
polar 310.12002 Da 45.46 s NA C11H22N2O6S 2.742540 0.0000006 0.0000039
polar 195.05656 Da 124.38 s NA C6H13NO4S 2.920660 0.0000006 0.0000041
polar 424.11927 Da 45.00 s NA C13H32N2O5S4 3.111636 0.0000007 0.0000043
polar 342.11236 Da 47.50 s NA C19H18O6 5.026800 0.0000007 0.0000046
polar 313.56890 Da 46.09 s NA NA 3.155939 0.0000009 0.0000057
polar 753.53621 Da 865.22 s NA NA -2.903183 0.0000009 0.0000059
polar 72.02134 Da 43.52 s NA C3H4O2 2.415400 0.0000011 0.0000067
polar 390.67583 Da 47.24 s NA NA 3.298686 0.0000012 0.0000072
polar 732.22414 Da 45.42 s NA C19H41N16O7PS3 3.223633 0.0000015 0.0000092
polar 107.98852 Da 67.88 s NA NA 3.461934 0.0000016 0.0000097
polar 197.03605 Da 160.38 s NA NA 3.401431 0.0000018 0.0000107
polar 537.17417 Da 52.41 s NA C13H40N5O9PS3 5.521678 0.0000018 0.0000109
polar 684.08553 Da 49.48 s NA C25H9N20O4P 3.583969 0.0000019 0.0000113
polar 192.02681 Da 83.77 s Isocitric acid C6H8O7 -2.705202 0.0000024 0.0000139
polar 626.16820 Da 185.51 s NA C19H47O12PS4 3.265538 0.0000025 0.0000142
polar 569.14799 Da 188.92 s NA C27H27N3O9S 3.538897 0.0000041 0.0000222
polar 293.11182 Da 74.41 s NA C12H23NO3S2 -2.212128 0.0000050 0.0000262
polar 474.01089 Da 47.04 s NA NA -2.037417 0.0000053 0.0000279
polar 424.14514 Da 52.69 s NA C16H32N4OS4 3.735806 0.0000064 0.0000330
polar 424.14674 Da 52.13 s NA C13H33N2O7PS2 4.570111 0.0000070 0.0000356
polar 507.04174 Da 50.76 s NA C21H5N11O6 4.651619 0.0000071 0.0000358
polar 549.85850 Da 29.05 s NA NA 2.026331 0.0000080 0.0000401
polar 135.98121 Da 29.42 s NA NA 2.270894 0.0000086 0.0000428
polar 153.12665 Da 47.66 s NA C8H15N3 -2.111851 0.0000128 0.0000614
polar 521.19632 Da 52.29 s NA C32H23N7O -2.246412 0.0000130 0.0000618
polar 335.08973 Da 101.06 s NA C9H23NO8P2 -2.014927 0.0000185 0.0000855
polar 535.03645 Da 50.76 s NA C16H6N15O6P 3.907807 0.0000206 0.0000924
polar 225.03063 Da 215.72 s NA C6H11NO6S 2.178694 0.0000227 0.0000990
polar 187.19382 Da 422.38 s NA C11H25NO 3.732177 0.0000287 0.0001207
polar 330.11118 Da 45.92 s NA C19H22OS2 3.572680 0.0000300 0.0001246
polar 195.07332 Da 46.94 s NA NA 4.865156 0.0000347 0.0001397
polar 553.82966 Da 26.74 s NA NA 3.671634 0.0000361 0.0001438
polar 128.04615 Da 784.34 s NA NA -2.372932 0.0000529 0.0001990
polar 256.94792 Da 41.13 s NA C5H7NO5S3 2.107670 0.0000536 0.0002017
polar 429.08512 Da 235.88 s NA C24H15NO7 -3.966080 0.0000562 0.0002090
polar 228.10056 Da 337.15 s NA C12H20S2 -2.111015 0.0000576 0.0002128
polar 689.03853 Da 100.75 s NA NA -2.260020 0.0000633 0.0002308
polar 271.09782 Da 40.14 s NA C12H18NO4P -2.046353 0.0000674 0.0002434
polar 356.12601 Da 54.83 s NA C20H20O6 5.723925 0.0000786 0.0002778
polar 343.95574 Da 45.53 s NA NA 4.570205 0.0000856 0.0002987
polar 315.10435 Da 40.18 s NA C16H17N3O2S 3.445711 0.0001049 0.0003585
polar 565.83665 Da 29.00 s NA C13H3N4O12PS4 2.332420 0.0001126 0.0003827
polar 406.27013 Da 988.50 s NA C13H34N12OS -2.724868 0.0001250 0.0004199
polar 353.01730 Da 49.08 s NA C10H15N3O5S3 -2.600020 0.0001300 0.0004334
polar 390.95388 Da 43.93 s NA C10H9N5O4S4 2.018588 0.0001491 0.0004882
polar 197.11625 Da 49.80 s NA C9H15N3O2 -2.151283 0.0001495 0.0004890
polar 107.98851 Da 151.00 s NA NA 3.289306 0.0001550 0.0005045
polar 582.17084 Da 54.22 s NA C36H27N2O4P 2.037894 0.0001676 0.0005382
polar 81.98175 Da 1174.56 s NA NA -3.701889 0.0001720 0.0005488
polar 407.90954 Da 42.95 s NA NA 2.751582 0.0002041 0.0006399
polar 594.18444 Da 187.01 s NA C43H22N4 -2.119586 0.0002277 0.0007060
polar 215.11621 Da 403.33 s NA C10H17NO4 -2.767454 0.0003425 0.0010097
polar 576.13728 Da 560.21 s NA NA -2.247904 0.0003452 0.0010165
polar 56.06225 Da 1178.52 s NA NA -2.424346 0.0004098 0.0011752
polar 159.12589 Da 38.85 s NA NA -2.345349 0.0004291 0.0012246
polar 351.08872 Da 183.84 s NA C7H13N9O8 -3.479490 0.0005237 0.0014622
polar 270.14690 Da 581.44 s NA C14H22O5 -2.152940 0.0008794 0.0023354
polar 327.56754 Da 50.04 s NA NA 2.483253 0.0010959 0.0028548
polar 143.16765 Da 179.80 s NA C9H21N 3.126743 0.0014405 0.0036369
polar 162.99232 Da 30.05 s NA NA 3.712654 0.0017044 0.0041816
polar 130.99596 Da 29.67 s NA NA 2.865890 0.0019884 0.0048043
polar 174.53774 Da 211.84 s NA NA -2.387574 0.0026957 0.0063143
polar 311.55770 Da 52.79 s NA NA 2.568152 0.0027307 0.0063777
polar 835.84772 Da 33.22 s NA C25H14O21P6 3.225463 0.0029057 0.0067014
polar 774.52839 Da 1183.19 s NA C32H75N10O7PS 2.758175 0.0033151 0.0075219
polar 641.08784 Da 50.18 s NA C14H32N11O6PS5 3.388157 0.0034151 0.0077244
polar 312.26765 Da 1181.50 s NA NA 2.938936 0.0042354 0.0092726
polar 611.17805 Da 1184.30 s NA C28H38NO8PS2 3.706555 0.0042868 0.0093637

Apolar Phase

Metabolites with adj P-value < 0.01 and absolute log ratio > 2
phase bucket Name Formula logFC p.value padj
apolar 467.00597 Da 39.82 s NA C20H9N3O9S 4.091001 0.0000000 0.0000000
apolar 1052.52432 Da 807.07 s NA C65H84N2S5 -2.031740 0.0000000 0.0000000
apolar 792.56643 Da 505.34 s NA NA -2.786192 0.0000000 0.0000000
apolar 870.56984 Da 803.03 s NA C51H84O7P2 -6.288501 0.0000000 0.0000001
apolar 191.04414 Da 41.28 s NA NA 2.316799 0.0000000 0.0000002
apolar 892.54105 Da 803.03 s NA C52H81N2O4PS2 -4.097175 0.0000001 0.0000004
apolar 244.90652 Da 77.09 s NA NA -2.088680 0.0000001 0.0000009
apolar 195.05665 Da 40.87 s 2-_N-morpholino_ethanesulfonic acid C6H13NO4S 8.717947 0.0000003 0.0000020
apolar 412.09570 Da 40.31 s NA C11H24N8OS4 5.281069 0.0000017 0.0000080
apolar 349.96167 Da 41.67 s NA NA -2.885525 0.0000252 0.0000832
apolar 886.56398 Da 850.98 s NA C34H66N26OS -2.770792 0.0000922 0.0002589
apolar 396.00441 Da 41.68 s NA NA -2.197764 0.0001141 0.0003107
apolar 381.98812 Da 41.95 s NA C18H7O8P -2.699859 0.0001375 0.0003630
apolar 652.36412 Da 499.78 s NA C25H58N4O11P2 2.075934 0.0002266 0.0005636
apolar 551.42610 Da 842.78 s NA NA -3.191314 0.0003743 0.0008795
apolar 335.94565 Da 41.39 s NA NA -2.081601 0.0007243 0.0015716

Export

## `mutate_if()` ignored the following grouping variables:
## Columns `phase`, `bucket`

T-test results were exported to ./data/processed/metabolomics/20200311_metabolomics_ttest.csv